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TITLE: METHOD FOR ASSIGNING AN INDIVIDUAL TO A POPULATION OF ORIGIN BASED ON 
MULTI-LOCUS GENOTYPES 

Background Of The Invention 

5 Throughout this application, various publications are referenced in parentheses by author and year. Full 

citations for these references may be found at the end of the specification immediately preceding the claims. The 
disclosures of these publications in their entireties are hereby incorporated by reference into this application to more 
fully describe the state of the art to which this invention pertains. 

The assignment of an individual to a population of origin based upon the individual's multi-locus genotype 

10 is a statistical problem which must consider features of the genetic architecture of the underlying populations from 
which the individual may have originated. For example, if there exist population specific alleles at certain loci (the 
frequency of a population specific allele is zero in all but one of the populations), then the presence of at least one 
of these alleles in the genotype of an individual indicates unequivocally the population to which the individual 
belongs. Unfortunately, it is often difficult to establish that certain alleles are population specific, since their 

15 absence in a sample of individuals from any one population may be either because the alleles are truly population 
specific, or because the frequencies of these alleles are low and the sample obtained from any given population was 
small. Clearly, the absence of an allele in a sample from a population does not justify the assumption that the allele 
is not present in the population. 

In the absence of a definitive marker for population of origin, or in the case where a genotype potentially 

20 exists in more than one population, statistical approaches must be employed to identify the most likely population 
of origin from among a set of "candidate populations." Further, these approaches must also evaluate the strength of 
evidence for the individual belonging to the most likely population of origin over the other competing candidate 
populations. Finally, the strength of evidence supporting the individual belonging to the most likely population of 
origin against another novel population that is not represented among the set of candidate populations must also be 

25 evaluated. 

The present application discloses a statistical model for the assignment of individuals to a population of 
origin that possesses the following features: 

1. The approach assumes that samples of individuals are available from a number of candidate populations 
and that these individuals have been genotyped for a number of marker loci. 
30 2. There may be any number of candidate populations and each population may have a different sample size. 

3. There may be any number of markers that have been genotyped in the individuals within each of the 
candidate populations. 

4. The individual to be assigned to a population of origin may have been genotyped for all, or only a subset of 
the marker loci. 

35 5. Marker loci genotypes in each candidate population are tested for conformance to Hardy- Weinberg 
Equilibrium (HWE) and Gametic Phase Equilibrium (GPE) expectations. 

6. Under the null hypothesis that an individual belongs to any one given candidate population, the probability 
of the multi-locus genotype is computed for that population. 

7. The posterior probability of the individual belonging to each of the candidate populations is then calculated 
40 utilizing any available prior knowledge concerning the population of origin. 
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8. The most likely population of origin of the tested individual is that population which possesses the greatest 
posterior probability of origin. It is recommended that an individual be assigned to that population only when the 
posterior probability of origin exceeds a threshold, such as 80%. 

9. The percentage of genotypes more rare than the genotype of the individual in the most likely population of 
origin can be calculated or simulated in order to ascertain whether the individual may actually belong to a novel 
population not included in the set of candidate populations. 

This model has application for example in the livestock industry for assigning an individual animal to a 
breed or to a population based on a desirable trait such as animal growth, quality grade, yield grade, marbling, rib- 
eye muscle area, dressing percentage, or meat tenderness. 



Summary Of The Invention 

The present invention provides a method of assigning an individual to a population of origin, which 
comprises: 

(a) identifying a set of candidate populations of origin, wherein each candidate population is characterized by 
1 5 genotype frequencies and allele frequencies at one or more marker loci;> 

(b) determining a population prior genotype probability for each individual and candidate population of origin 
using knowledge concerning the individual which is available prior to genotyping the individual; 

(c) genotyping the individual to identify the alleles at one or more of the marker loci identified in step (a) to 
thereby identify the individual's genotype; 

20 (d) based on the identified genotype of the individual, sequentially determining a population genotype 
probability for each candidate population of origin under a null hypothesis that the individual arose from 
the population; 

(e) combining the population prior genotype probability from step (b) and the population genotype probability 
from step (d) to obtain a population posterior genotype probability for each candidate population of origin; 

25 (f) identifying a most likely population of origin wherein the population has the largest posterior genotype 
probability among the set of candidate populations; and 

(g) assigning the individual to the population identified in step (f). 



Detailed Description Of The Invention 

30 The following definitions are presented as an aid in understanding this invention. 

As used herein a marker locus is defined as a unique location on a chromosome (locus) within the nuclear 
genome of an individual, at which variation among chromosomes and individuals may be detected. Examples 
include but are not limited to microsatellite, Restriction Fragment Length Polymorphism (RFLP), Random 
Amplified Polymorphic DNA (RAPD), Variable Number of Tandem Repeat (VNTR), and Single Nucleotide 
35 Polymorphism (SNP) loci. Marker loci are usually named, and the name expressed in italics. For example, AG LAI 7 
is a microsatellite locus located at the centromeric end of chromosome 1 in cattle. 

An allele is a genetic variant at a marker locus detected on a single chromosome. For example, for the A 
locus there may be n possible alleles and each allele is individually designated as A\, A 2t ...^ a . 
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The allele frequency is the frequency of an allele A\ at the A locus within a specific population and is defined as 



Diploid means the nuclear genome of the individual possesses pairs of chromosomes, in which one 
chromosome of each pair is transmitted by each parent. Without loss of generality, the methodology described here 
will be for diploid species. 

Genotype is defined as the combination of alleles at a single locus that is found within an individual. 
Genotypes at the A locus are of the form A A } for i and j between 1 and n. Individuals possessing two identical 
alleles /Mi are called homozygotcs and individuals possessing two different alleles (i heterozygoies. Similarly 
a multi-locus genotype is represented as the genotypes at each locus, e.g., A X A 2 B^ C 4 C 5 . 

Geno typing an individual means to analyze a sample of deoxyribonucleic acid (DNA) from the individual 
to identify the alleles present at one or more marker loci. 

A haplotype is defined to be the set of alleles at multiple loci that are present in a gamete (sperm or ova). 
If there are n a , and n^ alleles present at the A, 3 arid C loci, haplotypes are represented as A fifl± for i = 1,. ..,n a ; j 
= l,...,n b and k ,rv 

Hardy-Weinberg Equilibrium (HWE) means that in a random mating population in which there is no 
selection, migration, mutation or drift, population genotype frequencies occur as a simple function of allele 

2 

frequencies. Among homozygotes F(/Mi) = p A _ and among heterozygotes F(^j) = 2p M p A . 

Gametic Phase Equilibrium (GPE): Two (or more) loci are defined as being in GPE if all population 
haplotype frequencies occur as the product of individual allele frequencies, viz. ^(A fijC^ = P^P^P^ for * = It— »n,; 
j = l,...,Db and k = I, ...,1V For loci that are in GPE, individual loci are in HWE, and multi-locus genotype 
frequencies are obtained as the product of individual locus genotype frequencies. For example, 



A candidate population is a population from which a sample of individuals has been genotyped for 
multiple marker loci and sample allele frequencies have been determined for each locus. i 

Having due regard to the preceding definitions, the present invention concerns a method of assigning an 
individual to a population of origin, which comprises: 

(a) identifying a set of candidate populations of origin, wherein each candidate population is characterized by 

genotype frequencies and allele frequencies at one or more marker loci; 

(b) determining a population prior genotype probability for each individual and candidate population of origin 
using knowledge concerning the individual which is available prior to genotyping the individual; 

(c) genotyping the individual to identify the alleles at one or more of the marker loci identified in step (a) to 
thereby identify the individual's genotype; 

(d) based on the identified genotype of the individual, sequentially determining a population genotype 
probability for each candidate population of origin under a null hypothesis that the individual arose from 
the population; 

(e) combining the population prior genotype probability from step (b) and the population genotype probability 



n 




F(A i y4 2 B ) B i C t C 5 ) » F(^ 2 )F(5 J 5 3 )F(C l C i ) = 
< 2 P A , Pa2)(Pb3 )(2p C4 Pcs> = 4 Pai Pa^PbS P C4Pcs- 
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from step (d) to obtain a population posterior genotype probability for each candidate population of origin; 

(f) identifying a most likely population of origin wherein the population has the largest posterior genotype 
probability among the set of candidate populations; and 

(g) assigning the individual to the population identified in step (f). 

5 In one embodiment of the method, the individual is only assigned to the most likely population of origin if 

the posterior genotype probability for the most likely population of origin exceeds a threshold value. In one 
embodiment, the threshold value is determined empirically. In one embodiment, the threshold value is determined 
using a sample of individuals from each candidate population who are independent of individuals used to 
characterize each candidate population. In one embodiment, the threshold value is varied to determine the 
10 percentage of individuals who a) cannot be classified to a population of origin, b) are correctly classified, and c) are 
incorrectly classified. 

In one embodiment, the method further comprises: 

(a) computing a probability with which genotypes rarer than the individual's genotype occur in the most likely 
population of origin; and 

15 (b) if the probability in step (a) is above a threshold value, assigning the individual to the population of origin 
previously identified as the most likely population of origin, or if the probability in step. (a) is not above a 
threshold value, assigning the individual to a novel population that is not represented among the set of 
candidate populations of origin. 

In one embodiment, the threshold value is determined empirically. In one embodiment, the threshold 
20 value is determined using a sample of individuals from each candidate population who are independent of 
individuals used to characterize each candidate population. In one embodiment, the threshold value is varied to 
reduce the percentage of individuals who are incorrectly classified to a population. 

In one embodiment of the method, the population prior genotype probability is based on one or more 
morphological features of the individual. In a further embodiment, one or more morphological features allow the 
25 exclusion of one or more candidate populations of origin. In different embodiments, one or more morphological 
features are selected from the group consisting of coat color, presence or absence of horns, and presence or absence 
of Bos indicus (humped or Zebu cattle) features such as a shoulder hump or a long, downswept ear. In a further 
embodiment, the coat color is black or nonblack. 

In one embodiment, the population prior genotype probability is set to equal a proportion of total 
30 population size that comprises each candidate population of origin. In another embodiment, the population prior 
genotype probability is assumed to be uniform for each candidate population of origin. 

In one embodiment of the method, the marker locus genotypes for each candidate population of origin are 
in Hardy- Weinberg Equilibrium and Gametic Phase Equilibrium. In other embodiments, the marker locus 
genotypes for each candidate population of origin are not in Hardy- Weinberg Equilibrium or Gametic Phase 
35 Equilibrium. 

In one embodiment of the method, the individual is an animal. In further embodiments, the animal is a 
cow, a heifer, a steer, a bull, a bullock, a pig, a horse, a fish, a chicken, a duck, a lamb, a shrimp, an oyster, a 
mussel, or a shellfish. 
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In one embodiment of the method, the candidate population of origin is selected based on a desirable trait. 
In further embodiments, the desirable trait is selected from the group consisting of one or more of animal growth, 
quality grade, yield grade, marbling, rib-eye muscle area, dressing percentage, meat tenderness, meat flavor, meat 
palatability, fatness, fat color, unsaturated fatty acid content of fat, reproductive efficiency, prolificacy, disease 

5 resistance, feed conversion efficiency, drought tolerance, and heat tolerance. Marbling score in beef cattle is a 
subjective score assigned by a United States Department of Agriculture (USDA) grader to a carcass based upon the 
amount of intramuscular fat visualized in the longissimus dorsi muscle at the 12th to 1 3th rib juncture in properly 
chilled carcasses (United States Standards for Grades of Carcass Beef, 1997). Ribeye muscle area is the cross- 
sectional area of the longissimus dorsi muscle at the 11th to 12th rib juncture and is measured subjectively or by 

10 means of a grid calibrated in tenths of an inch at the same time as the marbling score is obtained. Quality grade is 
assigned by the USDA grader and is a combination of the marbling score and maturity (age) of the animal estimated 
from the size, shape and ossification of the bones and cartilages (especially the split chine bones) and the color and 
texture of the flesh. Younger animals (A maturity) are not penalized, but older animals (B maturity) have their 
marbling scores down rated into the quality grade. Yieid grade is assigned by the USD A. grader and is an estimate 

15 of the yield of closely trimmed (1/2 inch fat or less), boneless retail cuts expected to be derived from the major 
wholesale cuts (round, sirloin, short loin, rib, and square-cut chuck) of a carcass. The yield grade of a beef carcass 
is determined by considering four characteristics: the amount of external fat; the amount of kidney, pelvic and heart 
fat; the area of ribeye muscle; and the carcass weight. Carcasses possessing large amounts of exterior and interior 
fat receive larger yield grade scores indicating lower yields of lean meat. Dressing percentage is the ratio of hot 

20 carcass weight (the eviscerated carcass) to live animal weight immediately preslaughter and expressed as a 
percentage. 

In one embodiment of the method, the candidate population of origin is selected based on an undesirable 
trait. In a further embodiment, the undesirable trait is toughness of meat. 

This invention will be better understood from the methodology and examples which follow. However, one 
25 skilled in the art will readily appreciate that the specific methods and examples discussed are merely illustrative of 
the invention as described more fully in the claims which follow thereafter. 

Methodology 

Candidate Population Data 

30 Baseline data are gathered for each of the populations that are going to be candidates for the assignment of 

individuals. This should represent all of the known extant populations. The process involves sampling individuals 
from each population and genotyping them for the marker loci that are to be used in the classification process. The 
larger the number of individuals in each sample the better; 50 individuals is a "reasonable" target. Fewer 
individuals will sometimes be necessary for small populations. 

35 The data that are collected on the individuals to characterize the candidate populations are used to: a) 

estimate allele frequencies in the candidate population, b) estimate genotype frequencies in the candidate 
population, and c) test to deterrnine if the marker loci are in Hardy- Weinberg Equilibrium and Gametic Phase 
Equilibrium in each of the candidate populations. The allele frequencies are estimated by counting the number of 
alleles of each type that are present in the sample and expressing the totals as a proportion of the total number of 

40 alleles in the sample. Similarly, the genotype frequencies for each marker locus are estimated by counting the 
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number of genotypes of each type that are present in the sample and expressing the totals as a proportion of the total 
number of genotypes in the sample. 

The numbers of alleles present in the sample from each population are tabulated. Let N be the number of 
alleles present in the sample of individuals which are known with certainty as having originated from the i* 
5 candidate population i = l,...,p. In a diploid species, N is twice the number of individuals in the sample and the 
sample size may vary among the p populations. Suppose that a series of m marker loci are genotyped in ail 
individuals and populations. Within each population, the resulting genotype counts may be tested for Hardy- 
Weinberg Equilibrium (HWE) and Gametic Phase Equilibrium (GPE) using the well known likelihood ratio or x 2 
"goodness of fit" tests, as described for example in Weir (1996). Without loss of generality, we assume that each of 
10 the candidate populations is found to be in HWE and GPE. Further, the number of alleles present in the sample for 

each marker locus and each population is tabulated as follows. Let n^. be the number of Aj alleles detected in the 

sample from the i* population. The data for locus A (and similarly for the remaining marker loci) may be 
represented as: 
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Note that certain of the n \ = 0 if the j* allele at the A locus is not detected in the sample from the i* population. 

As an example, suppose that at the A locus we observes three alleles Aj, A 2 and A 3 . The individuals are 
diploid, so genotype is defined by the combination of two alleles that are present in any one individual. Assume 
that when we genotype 120 individuals from a given candidate population, we observe the following: 
20 Genotype Number of individuals 



A,A, 22 

A,A 2 12 

A,A 3 8 

A 2 A 2 40 

25 A 2 A 3 16 

A3A3 22 

Total 120.. 



The genotype frequencies are obtained from the sample as the relative frequencies of the genotypes, so for 
30 the A,A» genotype we have a genotype frequency of 22/120 = 0.1833. To obtain the allele frequencies, we count 
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the number of alleles present in the sample using the genotype counts above. So there are 22 A, A, individuals with 
2 A, alleles, 12 A,A 2 individuals with 1 A, allele and 8 A,A 3 individuals with 1 A t allele. This gives us a total of 64 
A 1 alleles. 
Therefore: 
5 Allele Number of alleles 
A, 64 
A 2 108 

A3 68 

Total 240 . 

10 The total number of alleles is of course twice the total number of individuals. 

Candidate Population Prior Genotype Probabilities 

For each individual to be tested, prior probabilities are assigned for the probability of belonging to each of 
the candidate populations. Prior probabilities are assigned based on knowledge that is available before the DNA 

15 sample from an individual was analyzed for marker genotype information. If there is no prior information then each 
population is assigned an equal prior probability of having given rise to the individual. Alternatively, certain 
morphological data may be available on an individual which allow the exclusion of certain of the candidate 
populations, in which case the prior probabilities of these populations for this individual are set to zero. If, for 
example, the individual is a horned animal and only three out of ten candidate populations contain horned animals, 

20 the prior probabilities for the seven non-horned populations would be set to zero and the prior probabilities for the 
three horned populations would each be set to 1/3 in the absence of further information. 

Let Py represent the a priori or prior probability that the j* individual originated from the i* population. If 
individuals are sampled at random with respect to population of origin, we should elect to set the population prior 
probabilities equal to the proportion of the total population size that comprises each candidate population. If no pre- 

25 existing information was available, we assume a non- informative or uniform prior of V ti = 1/p for i = 1,.. .,p. Hence 
the individual had an equal chance of originating from any of the p candidate populations when a uniform prior is 
used. 

Prior probabilities may differ for each individual that is to be tested, but in every case must sum to unity as 
1-1 

30 

Candidate Population Genotype Probabilities 

Each individual is genotyped for the marker loci for which baseline information was gathered for each 
candidate population. The individual's genotype probability is then estimated (using a maximum likelihood 
approach) for each of the candidate populations. 
35 Suppose that an individual that is to be assigned to a population is genotyped for the m marker loci and, 

arbitrarily, the multi-locus genotype is determined to be A X A 2 B^,.M A M S - The probability of this genotype 
occurring in the i* population is determined as follows: 

1. Under the null hypothesis that the individual originated from the i* population, the individual may be 
incorporated into the sample data for this population and the allele counts at each locus updated. For example, at the 
40 A locus, the sample counts for the i* population become: 

7 



WO 02/38737 



PCT/US01/47521 



Allele 


A, 


A2 


Ai 






E 


Allele Count 


=A1 +1 


n A ' 2 +l 


"A3 




"Ana 


N+2 



Similarly, at the 5 locus, the sample counts for the i ,h population become: 
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5 

2. Under the assumption of Hardy-Weinberg Equilibrium in the I th population, the maximum likelihood 
estimate (MLE) of genotype frequency at each of the marker loci is obtained. For example, at the A locus, the MLE 

of the probability of the A X A 2 genotype F\(A x /t 2 ) is 2(n A 'j + l)(n A 2 + l)/(N+2) 2 . At the B locus, the MLE of the 

i 2 i 2 

probability of the B 3 Bj genotype Fj(fl 3 5 3 ) is (n B 3 +2) /(N +2) . 

10 3. Under the assumption of Gametic Phase Equilibrium in the i* population, the MLE of the multi-locus 
genotype frequency is obtained as the product of the genotype frequencies at each of the marker loci. Thus, the 
MLE of the probability of the A l A 2 ByB i .,M 4 M i genotype in the i* population YiA x A 2 B l B^..M 4 Ms) is: 

{2(nj { + l)(n^ 2+ l)/(N i +2) 2 }{(n B ^2) 2 

/(NV2) 2 }...{2(n IV {4+l)(n M i 5 +i)/(N+2) 2 }. 

15 In general, let Gj represent the multi-locus genotype of the j* individual. Then Fj(Gj) represents the 

probability of the genotype of the j* individual in the f* 1 population. 

If the candidate population is not in Hardy-Weinberg and Gametic Phase Equilibrium, the population 

genotype frequency is estimated by the frequency of the individual's genotype in the sample from each candidate 

population. This process involves sequentially adding the individual to the sample so that the frequency of any 
20 genotype that was not present in the original sample is l/(N+l) where N is the number of individuals sampled for 

the population. 

Candidate Population Posterior Genotype Probabilities 

The posterior probabilities that the individual belongs to each of the candidate populations are determined, 
25 and the population with the largest probability is selected as the "most likely" population of origin. 

The posterior probability of the individual* s genotype originating from the i* candidate population is 
obtained by combining both the population prior genotype probabilities and the candidate population genotype 
probabilities as follows: 
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For the j* individual, is computed for each population i = l,...,p and the population with the greatest 4^ 
value is the most likely population of origin among the set of candidate populations. 

Simply taking the population with the largest posterior probability as being the "most-likely" population of 
origin may not be a very good decision rule. The additional steps to the procedure described below are designed to 
5 help the user arrive at a decision rule that has quantified success and error rates. 

A threshold value must be determined for the posterior probability in order to define a decision rule for 
accepting an individual as originating in one of the candidate populations. For example, one may choose to accept 
an individual as originating in a population if exceeds 0.90 for one population. This is interpreted to mean that 
among the available candidate populations, there is a 90% chance the individual originated from the most likely 
10 population and only a 10% chance of originating in any of the other populations. Individuals that are hybrids 
typically produce approximately equal posterior probabilities of belonging to the two populations that contributed 
the parents of the hybrid and thus these hybrid animals are generally not assigned to any one population. 

If a second set of samples of individuals from each of the populations is available or can be obtained that 
are independent of the samples used to produce the baseline candidate population data, these individuals can be 
15 genotyped and posterior probabilities can be calculated for each individual and each of the candidate populations. A 
decision rule can then be empirically determined that meets the requirements of the user. For example, the user may 
wish to ensure that 95% of the individuals that are assigned to a population are correctly assigned. Thus, one may 
fmd that assigning an individual to a population only when the posterior probability of belonging to that population 
is greater than 90% results in 95% of individuals being correctly assigned to their population of origin. By altering 
20 the threshold for the posterior probability decision rule, one can determine the proportion of individuals that are 
correctly classified, incorrectly classified and not classified respectively. Individuals for which the largest posterior 
probability falls below the threshold are not assigned to a population. 

Rarity of Genotype in Candidate Population 

25 Occasionally an individual may be incorrectly assigned to a population because the individual arose from a 

population that was not represented in the group of candidate populations. In this case, the procedure described 
above will identify the population that is most similar to the population from which the individual actually arose. If 
the posterior probability is greater than the threshold (i.e., all of the remaining populations are quite different to the 
population from which the individual actually arose), the individual will be incorrectly assigned. These cases of 

30 incorrect assignment can be identified by calculating the probability of a rarer genotype in the assigned population. 
If this probability is low, say 5% (this threshold is also user defined and can also be determined empirically), then 
one might reject the individual from the population and change the classification of the individual to "unassigned." 
The underlying logic here is that even though the individual shows strong evidence for belonging to only one of the 
populations, it is actually a rare genotype in that population. From a statistical perspective, it is more likely that the 

35 individual actually has a fairly common genotype in a population that is not represented among the candidate 
populations. 

X^j nkjrn + 1) 

If there are m marker loci and there are n k alleles at the k* marker, there will be T =| j possible 

multi-locus genotypes in any population. It does not require many loci or many alleles at the individual marker loci 
for the total number of genotypes, T, to become very large. For example, with m = 10 marker loci and n k = 6 alleles 
40 at each locus (which is characteristic of microsatellite loci), there are more than a trillion possible genotypes. In this 
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case, we estimate the frequency of a genotype that is rarer than the genotype present in the tested individual by 
simulation. A large number of multi-locus genotypes (such as 100,000) is simulated by drawing alleles at random 
from the most likely population using the relative allele frequencies for the population after adding the alleles of the 
individual to the sample. The probability of each of the simulated genotypes is then computed as described above. 
Finally, the percentage of simulated genotypes for which the multi-locus genotype frequency is lower than that of 
the tested individual is calculated. If the percentage of rarer genotypes is low, say 5% or less, we might conclude 
that the tested individual has a genotype that is too rare for it to truly have originated in the most likely population 
and that the individual actually belongs to a novel population. 



Advantageous Features of the Approach 

The approach described herein provides certain advantages including, but not limited to, the following: 

1. The approach assumes that any allele that is present in an individual to be tested but that is absent from the 
sample for any one candidate population, is absent because it is a rare allele that was not captured in the sample 
rather than the allele being population specific. This approach loses statistical power in the sense that population 
specific alleles unequivocally eliminate from consideration any candidate population that does not possess the 
alleles. However, central to this argument is the fact that without very large samples of individuals from each of the 
candidate populations, it is impossible to discriminate between alleles that are rare and alleles that are absent from 
any population. Thus, the approach presented herein is conservative in that it will underestimate the posterior 
probability of population of origin when there are population specific alleles. On the other hand, our approach gains 
specificity in that populations are not rejected from consideration simply because an allele was not present in the 
sample of individuals drawn from the population. 

2. The approach recognizes that there may well be a number of potential populations that were not selected as 
candidates because they were not sampled in order to define them as a candidate population. There may well be 
individuals that are submitted for testing that originated in some novel and unsampled population. These individuals 
will have posterior probabilities of population of origin estimated by the procedure and will be assigned to the 
candidate population that is genetically most similar to the true population of origin. In some cases, the posterior 
probability for one candidate population may be very high, even though the individual did not originate from this 
population. In order to identify misclassifications, we estimate the cumulative probability distribution function for 
genotypes that are rarer than that of the individual to be classified. This allows estimation of the probability of a 
rarer genotype in the most likely population of origin of the individual. If this probability is low, perhaps 5% or 
less, one should conclude that the individual actually originated in a population not included in the candidate set, 
since only 5% of genotypes are rarer in the most likely of the candidate populations. 

3. The power of the approach depends on the numbeT of marker loci that are typed in the individuals to be 
classified and the candidate populations and the degree to which marker allele frequencies are skewed among the 
candidate populations. However, the approach is able to utilize all available information. If certain individuals have 
been genotyped for only a subset of the available markers, the candidate population genotype probabilities and 
therefore the posterior probabilities are computed only for the available multi-locus marker genotype. 

4. The probability thresholds for assigning an individual to a population of origin based upon the posterior 
probability and for accepting the individual as truly belonging to the most likely population based upon the 
probability of a rarer genotype must be determined empirically. Preferably a second independent sample from each 
of the candidate populations should be genotyped and posterior probabilities computed for each candidate 
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population. By varying an artificial posterior probability threshold for accepting an individual as belonging to the 
most likely population of origin, we can empirically determine the percentage of individuals that a) cannot be 
classified to a population of origin, b) are correctly classified, and c) are incorrectly classified. Among those 
individuals that are incorrectly classified to a population based upon the posterior probability, altering the 
5 ' acceptance threshold for the percentage of rarer genotypes further allows the reduction in the overall percentage of 
misclassified individuals. 

Example 

Consider the following three candidate populations, which have been genotyped for two marker loci. The 
10 A locus has 2 alleles and the B locus 3 alleles if we ignore the subdivision into candidate populations. The 
individuals that were genotyped for the two loci from each of the candidate populations gave the following allele 
counts: 
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Suppose that the first individual presented for classification to a population of origin has genotype Gj 
A { A iB 3 B}. The candidate population genotype probabilities are: 



F,(G,)= {22 2 /42 2 }{12 2 /42 2 } = .0224, 
20 F 2 (G,) = {22 2 /52 2 } {2 2 /52 2 } = .0003, . 
F 3 (G,) = {52 2 /62 2 } {57 2 /62 2 } = .5946. 

We shall assume that this individual has an equal a priori chance of originating from any of the three 
populations and hence Pn = P21 = P31 = 1/3. 
25 The posterior probabilities of belonging to each of the candidate populations are: 



0.33*0.0224 



0.33x0.0224 + 0.33*0.0003 + 0.33*0.5946 
0.33*0.0003 



21 0.33*0.0224 + 0.33*0.0003 + 0.33*0.5946 



= .0363 
= .0004 
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0;33;c0.5946 

= — = .9633 , 

31 0.33x0.0224 + 0.33*0.0003+0.33*0.5946 



Since the magnitude of the posterior probability for the third population exceeds a threshold (which we 
shall arbitrarily set at .90 for the purposes of this example), we can conclude at this stage that the individual 
5 originated either from the third candidate population or from a novel population genetically similar to the third 
candidate population. In order to discriminate between these two situations, we must compute the probability of a 
rarer genotype than A X A \B^ in the third candidate population. Since there are only 1 8 possible genotypes for this 
example, we do not need to simulate the distribution of genotypes and provide the distribution: 
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10 

The genotype distribution reveals that the A \A \BJ$i genotype is the most common genotype in the third 
candidate population and that fully 40.54% of individuals within this population have genotypes that are more rare 
than the genotype of the tested individual. Therefore we conclude that the tested individual originated from the third 
candidate population and not from a novel population that was similar in genetic structure. 
15 Applications 

The approach disclosed herein has application for example in the livestock industry where there is a need 
to be able to determine value differences in live animals due to the inherent genetic variation in the yield of tender 
and marbled beef from their carcasses. Packers are forced to sort through thousands of carcasses from animals 
slaughtered each day in order to identify those that meet the specifications of their customers. Due to the very high 
20 daily volume of slaughter animals and limited cooler space (which reduces ability to sort), packers are unable to 
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efficiently market their inventory based upon quality specifications. Further, packers have no ability to discriminate 
among the carcasses that do not grade choice that could be marketed as a tender product. By and large, the variation 
in product specifications that the packers must manage each day correlates directly to the variation in the cattle 
received. 

5 Knowledge of an animal's underlying genetic predisposition to yield marbled and tender beef would allow 

the stratification of the existing commodity market to facilitate the management and marketing of animals based 
upon product specifications. As much as 50 percent of the variation in growth and carcass yield and quality 
attributes in cattle is determined by the additive effects of genes. The remaining variation is due to the environment 
that an animal is exposed to prior to entry to the feedlot and due to the management the animal receives during the 

10 feedlot and slaughter phases of production. Thus, at least 50 percent of the variation that currently exists within the 
commodity cattle market could be eliminated by grouping cattle according to their individual genotypes at entry 
into the feedlot. These animals could then be managed, fed and slaughtered as a uniform group and could then be 
marketed according to their quality attributes. This model would allow the creation of new "branded" products for 
the marketing of products such as lean and tender beef. 

1 5 The approach described in the present application can be applied not only to beef cattle but also to other 

livestock such as fish, pigs, chickens, lambs, shrimp, mussels, oysters, and shellfish. 
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What is claimed is : 

1 . A method of assigning an individual to a population of origin, which comprises: 

(a) identifying a set of candidate populations of origin, wherein each candidate population is 
characterized by genotype frequencies and allele frequencies at one or more marker loci; 

(b) determining a population prior genotype probability for each individual and candidate population 
of origin using knowledge concerning the individual which is available prior to genotyping the 
individual; 

(c) genotyping the individual to identify the alleles at one or more of the marker loci identified in step 
(a) to thereby identify the individual's genotype; 

(d) based on the identified genotype of the individual, sequentially determining a population 
genotype probability for each candidate population of origin' under a null hypothesis that the 
individual arose from the population; 

(e) combining the population prior genotype probability from step (b) and the population genotype 
probability from step (d) to obtain a population posterior genotype probability for each candidate 
population of origin; 

(f) identifying a most likely population of origin wherein the population has the largest posterior 
genotype probability among the set of candidate populations; and 

(g) assigning the individual to the population identified in step (f). 

2. The method of claim 1, wherein the individual is only assigned to the most likely population of origin if 
the posterior genotype probability for the most likely population of origin exceeds a threshold value. 

3. The method of claim 1, which further comprises: 

(a) computing a probability with which genotypes rarer than the individual's genotype occur in the 
most likely population of origin; and 

(b) if the probability in step (a) is above a threshold value, assigning the individual to the population 
of origin previously identified as the most likely population of origin, or if the probability in step 
(a) is not above a threshold value, assigning the individual to a novel population that is not 
represented among the set of candidate populations of origin. 

4. The method of claim 2, wherein the threshold value is determined empirically. 

5. The method of claim 4, wherein the threshold value is determined using a sample of individuals from each 
candidate population who are independent of individuals used to characterize each candidate population. 

6. The method of claim 4, wherein the threshold value is varied to determine the percentage of individuals 
who a) cannot be classified to a population of origin, b) are correctly classified, and c) are incorrectly 
classified. 

7. The method of claim 3, wherein the threshold value is determined empirically. 
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8. The method of claim 7, wherein the threshold value is determined using a sample of individuals from each 
candidate population who are independent of individuals used to characterize each candidate population. 

9. The method of claim 7, wherein the threshold value is varied to reduce the percentage of individuals who 
are incorrectly classified to a population. 

10. The method of claim 1, wherein the population prior genotype probability is based on one or more 
morphological features of the individual. 

11. The method of claim 10, wherein one or more morphological features allow the exclusion of one or more 
candidate populations of origin. 

12. The method of claim 11, wherein one or more morphological features are selected from the group 
consisting of coat color, presence or absence of horns, presence or absence of a shoulder hump, and 
presence or absence of a long, downswept ear. 

13. The method of claim 12, wherein the coat color is black or nonblack. 

14. The method of claim 1, wherein the population prior genotype probability is set to equal a proportion of 
total population size that comprises each candidate population of origin. 

15. The method of claim 1, wherein the population prior genotype probability is assumed to be uniform for 
each candidate population of origin. 

16. The method of claim 1, wherein marker locus genotypes for each candidate population of origin are in 
Hardy-Weinberg Equilibrium and Gametic Phase Equilibrium. 

17. The method of claim 1, wherein marker locus genotypes for each candidate population of origin are not in 
Hardy-Weinberg Equilibrium or Gametic Phase Equilibrium. 

1 8. The method of claim 1 , wherein the individual is an animal. 

19. The method of claim 18, wherein the animal is a cow, a heifer, a steer, a bull, a bullock, a pig, a horse, a 
fish, a chicken, a duck, a lamb, a shrimp, an oyster, a mussel, or a shellfish. 

20. The method of claim 1, wherein the candidate population of origin is selected based on a desirable trait. 

21. The method of claim 20, wherein the desirable trait is selected from the group consisting of one or more of 
animal growth, quality grade, yield grade, marbling, rib-eye muscle area, dressing percentage, meat 
tenderness, meat flavor, meat palatability, fatness, fat color, unsaturated fatty acid content of fat, 
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reproductive efficiency, prolificacy, disease resistance, feed conversion efficiency, drought tolerance, and 
heat tolerance. 

22. The method of claim 1 , wherein the candidate population of origin is selected based on an undesirable trait. 

23. The method of claim 22, wherein the undesirable trait is toughness of meat. 
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